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Despite  its  enormous  progress  in  the  last  few  decades,  machine  vision  is  still  far  from  achieving  the 
goal  that  human  vision  attains  with  such  speed  and  reliability  •-  in  David  Marr's  words,  to  "Imow 
what  is  where  by  looking."(Marr,  1976).  Recent  results  in  the  physiology  and  psychophysics  of 
visual  attention  accentuate  the  gap  between  machines  and  humans,  and  provide  a  first  step  to 
understanding  why  it  is  so  large  and  what  machines  must  learn  in  order  to  overcome  it. 

Paradoxically,  what  appear  to  be  the  simplest  tasks  for  humans  may  be  the  most  difficult  for 
machines.  Consider,  for  example,  recognizing  your  mother  in  a  sketch  of  her  sitting  in  the  kitchen. 
You  could  immediately  and  effortlessly  locate  her  face,  match  it  with  your  memory,  and 
pronounce  it  a  good  or  bad  likeness.  If  the  sketch  were  upside-down  you  could  easily  right  it  for  a 
proper  view.  You  would  probably  expend  the  most  painstaking  scrutiny  in  determining  just  which 
feature  was  slightly  off,  but  even  so,  your  final  judgment  would  be  quick.  In  contrast,  a  computer, 
using  the  most  sophisticated  face  recognition  routine,  would  perform  the  task  slowly  and 
incompletely,  because  it  would  not  know  where  to  start.  Given  the  location  of  the  two  eyes  in  a 
sketch  cluttettKl  with  dark  round  blobs,  the  routine  could  then  search  for  the  mouth,  nose  and  chin 
at  the  appropriate  distances  and  methodically  match  each  feature  to  a  virtually  identical  image  in  its 
memory.  But  failing  to  find  the  eyes,  it  could  not  go  on  to  recognize  the  face. 

The  difficulty  of  the  face  recognition  problem~and,  more  generally,  object  recognition  --  has  called 
into  question  one  of  the  main  assumptions  underlying  the  construction  of  a  machine  that  sees  as 
humans  do.  The  assumption  holds  that  the  goal  of  the  first  stages  in  vision  is  solely  to  determine 
"where"  things  are  -  that  is,  to  transform  the  initial  image,  an  array  of  intensity  values,  into  a  map 
of  the  scene  which  records  the  distance  and  orientation  of  each  surface  point  relative  to  the  viewer 
(the  "2-1/2D  sketch").  In  machine  vision  the  2>1/2D  sketch  may  serve  to  guide  a  mobile  robot 
around  an  obstacle  or  to  control  its  manipulations  as  it  picks  up  a  tool.  But,  like  the  raw  image 
from  which  it  is  computed,  the  2-1/2D  sketch  is  itself  simply  a  large  array  of  numbers.  Although 
it  may  contain  preliminary  information  for  object  recognition,  by  assigning  a  color  or  texture  to 
each  surface  point,  it  does  not  tell  "what"  things  are.  The  critical  task  in  object  recognition  is 
therefore  to  find  the  object  or  its  crucial  part  witlun  an  array  of  intensity  values  or  distances.  Until 
now,  many  of  the  attempts  to  elucidate  object  recognition  (reviewed  by  Besl  and  Jain,  1985  and 
by  Harmon  et  al.,  1979,  for  example)  have  assumed  that  the  relevant  object  is  already  located  and 
isolated  in  the  image. 


I 


Unlike  machines,  humans  are  adept  in  spotting  the  salient  features  of  an  object.  To  understand  the 
mechanisms  underlying  this  ability,  psychophysicists  have  investigated  visual  attention.  Treisman 
(1983)  and  Julesz  (1984)  have  demonstrated  that  humans  are  extremely  efficient  in  detecting  a  part 
of  an  image  that  differs  in  a  single  aspect  from  its  background.  For  example,  a  red  dot  {\it  pops 
out}  in  a  field  of  yellow  dots,  and  the  same  happens  for  a  vertical  line  in  a  field  of  horizontal  lines. 
The  time  required  to  detect  the  unusual  item  is  independent  of  the  number  of  other  items,  implying 
that  the  search  for  it  occurs  in  parallel  across  the  entire  field.  The  human  visual  system  obviously 
possesses  a  fast,  parallel  mechanism  which  can  direct  attention  to  salient  chunks  of  the  image.  * 
Although  the  possible  computational  purposes  of  this  mechanism  have  not  been  probed  by 
psychophysical  experiments,  its  potential  role  in  object  recognition  seems  critical.  For  example,  in 


^This  mechanism  is  sometimes  called  "preattentive."  Here,  we  consider  it  as  part  of  the  entire 
attention  mechanism  whose  characteristics  probably  require  more  complex  descriptions  than 
'"serial"  or  "parallel." 
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face  recognition  the  attention  mechanism  may  perform  two  essential  steps:  first,  to  locate  "blobs" 
which  could  be  eyes;  and  second,  to  direct  processing  toward  the  blobs  to  verify  that  they  are  eyes 
and  thereby  to  initiate  recognition.  The  role  of  attention,  therefore,  may  be  not  only  to  spotlight 
distinctive  parts  of  the  image  but,  more  importantly,  to  segment  the  image  into  objects  or  parts  of 
objects,  a  crucial  first  step  in  determining  what  things  are. 

An  important  and  still  open  question  is:  what  are  the  features  or  primitives  that  drive  attention? 
Likely  candidates  are  separable  features  which,  by  definition,  can  be  attended  to  selectively  and 
are  processed  independently  and  in  parallel.  Pop-out  and  texture  discrimination  experiments 
provide  a  test  for  separable  features  and  so  far  have  diagnosed  color,  line  orientation,  line  ends 
(terminators)  and  possibly  crossings  as  candidates. 


Conjunction  experiments  test  whether  two  or  more  separable  features  may  combine  to  produce  a 
higher-order  primitive.  For  example,  when  a  green  T  in  a  field  of  randonily  mixed  green  Xs  and 
brown  Ts  is  the  target,  it  does  not  pop  out,  and  the  time  required  to  detect  it  increases  linearly  with 
the  numbt  r  of  background  items.  Thus  the  detection  of  a  particular  conjunction  of  color  and  shape 
appears  to  require  a  search  over  each  item  in  turn,  across  the  entire  field.  Conjunction  experiments 
thus  reveal  another  aspect  of  the  attention  mechanism,  a  serial  searchlight  which  appears  to 
operate  independently  of  eye  movements  and  does  for  feature  conjunctions  what  the  parallel 
mechanism  does  for  features. 


Until  recently,  all  conjunctions  between  known  separable  features  had  been  shown  to  require  the 
serial  searchlight.  The  recent  results  of  Nakayama  and  Silverman  (1986)  reveal  a  surprising 
exception  to  this  pattern.  In  pop-out  experiments  using  fields  of  small  rectangular  patterns 
displayed  on  a  color  television  monitor,  the  authors  demonstrate  that  binocular  disparity  and 
motion  individually  behave  as  separable  features.  The  conjunction  of  motion  and  color  does  not; 
the  scare  n  for  a  pattern  of  blue  upward-moving  dots  is  slow  and  serial  across  a  field  of 
blue-downward  patterns  and  red-upward  patterns.  Contrary  to  this  trend,  conjunctions  of 
binocular  disparity  and  either  color  or  motion  behave  as  separable  features:  they  are  searched  for  in 
parallel.  (The  authors  report  that  when  the  field  splits  into  two  planes,  one  in  front  of  the  other,  the 
search  for  a  conjunction  amounts  to  a  pop-out  of  the  unusual  item  in  one  plane.  Thus  we  suggest 
that  it  may  be  possible,  using  a  different  kind  of  motion  stimulus,  to  create  separate  planes  of 
coherent  motion  and  thereby  induce  a  parallel  search  for  motion-color  conjunctions.) 


The  psychophysical  studies  on  separable  features  coincide  with  the  recent  emphasis  on  functional 
localisati^)n  in  visual  neurophysiology  and  neuroanatomy.  It  is  tempting  to  draw  an  explicit 
connection  between  biology  and  psychophysics  by  equating  different  visual  cortical  areas  with 
different  ""eature  maps;  for  example,  calling  V4  the  color  map,  MT  the  motion  map,  and  VI  the 
orientation  map.  The  psychophysics  would  suggest  that  in  a  given  feature  map,  at  each  spatial 
location  tliere  exists  a  collection  of  neurons  each  tuned  to  a  different  value  of  the  feature  (e.g.  red, 
green  or  I  lue  for  color).  Although  such  an  organization  has  not  been  demonstrated,  the  evidence 
for  segregation  of  functionally  similar  neurons  m  distinct  cortical  areas  is  steadily  accumulating. 
From  this  point  of  view,  the  results  of  Nakayama  and  Silverman  have  interesting  implications  for 
neurons  and  feature  maps:  they  preclude  the  existence  of  neurons  tuned  for  both  motion  and  color, 
and  predict  the  existence  of  neurons  tuned  to  a  particular  combination  of  binocular'  disparity  and 
motion  ar^d  of  neurons  tuned  to  disparity  and  color.  The  results  also  suggest  that  feature  maps  may 
be  replies  ed  at  each  of  several  disparity  planes.  The  prediction  of  disparity-motion  tuned  neurons 
is  supported  by  Maunsell  and  van  Essen's  (1983)  report  of  similar  neurons  in  cortical  area  MT. 
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Other  recent  work  in  visual  neurophysiology  puts  the  emphasis  on  a  different  aspect  of  attention. 
Rather  than  address  computational  questions  such  as,  "what  are  the  salient  features?"  and  "how 
does  the  attention  mechanism  work?"  or  the  psychophysical  question  "is  feature  processing  parallel 
or  serial?",  the  new  class  of  physiology  experiments  on  alert  animals  seek  to  demonstrate  the  ways 
in  which  attention  can  modulate  neuronal  responses.  In  the  course  of  such  experiements,  insights 
into  the  neural  circuitry  and  anatomical  location  of  the  attention  mechanism  have  emerged.  For 
example,  based  on  studies  of  attention-mediated  modulation  in  the  inferior  parietal  lobe  (area  7), 
Lynch  et  al.  (1977)  have  proposed  that  neurons  there  are  responsible  for  directing  attention  to 
visual  targets. 

More  recent  research  has  demonstrated  the  effects  of  attention  at  other  levels  in  the  visual  pathway. 
Moran  and  Desimone  (1985)  have  recently  shown  that,  in  the  monkey,  the  response  of  a  neuron  in 
V4  or  IT  to  a  preferred  stimulus  (for  example,  a  red  horizontal  bar)  is  dramatically  reduced  when 
the  animal  ignores  it  and  instead  attends  to  an  ineffective  stimulus  (such  as  a  green  vertical  bar) 
within  the  same  receptive  field  (which,  for  IT  neurons,  may  extend  at  least  12°).  The  response  of 
the  neuron  to  the  preferred  stimulus  is  unaffected  when  the  attended  stimulus  is  outside  its  receptive 
field.  Thus  V4  and  IT  neurons  are  able  to  filter  out  an  irrelevant  stimulus  when  it  competes  with  a 
relevant  stimulus  within  the  same  receptive  field.  VI  neurons  do  not  have  this  property,  and  the 
monkey  can  not  even  perform  the  differential  attention  task  when  the  two  stimuli  are  close  enough 
to  fit  within  a  single  receptive  field  in  V 1 . 


A  recent  psychophysical  experiment  in  humans  by  Sagi  and  Julesz  (1986)  provides  an  intriguing 
complement  to  these  physiological  results.  Sagi  and  Julesz  find  that  visu^  attention  directed  to  a 
random  location  for  an  orientation  discrimination  task  enhances  the  detection  of  a  test  flash 
presented  simultaneously  within  a  certain  radius  of  the  target.  The  area  of  enhancement,  which  the 
authors  conjecture  to  be  the  area  covered  by  the  searchlight  of  attention,  varies  from  1.5°  at  2° 
eccentricity  to  about  3°  at  4°  eccentricity.  Interestingly,  these  areas  are  likely  to  be  larger  than  the 
average  receptive  field  sizes  in  VI. 

The  above  results  imply  that  attention  to  one  region  of  an  image  may  involve  both  suppression  of 
visual  processing  in  irrelevant  regions  and  enhancement  of  visual  processing  in  relevant  regions. 
Thus  attention  may  indeed  be  responsible  for  directing  a  processing  focus  to  specific  locations  in 
the  initial  steps  of  recognition.  Yet  although  biological  research  may  have  found  the  key  to  machine 
vision,  it  has  yet  to  describe  how  it  opens  the  lock. 

Computational  results  suggest  that  the  attention  mechanism  may  be  even  more  complex  and 
powerful  than  experiments  have  revealed.  Consider  again  the  face  recognition  problem.  Individual 
features  such  as  eyes  or  the  curved  line  of  nose  and  mouth  can  by  themselves  lead  to  the  hypothesis 
of  a  face  (see  figure  2a).  In  contrast,  as  figure  2b  shows,  features  alone  cannot  be  the  only  cue  for 
recognition.  The  spatial  relationship  between  the  two  eye  tokens  and  the  closed  outer  contour  can 
also  cue  the  face  recognition  process.  Ullman  (1984)  has  argued  cogently  that  spatial  relations 
must  be  computed  by  a  mechanism  similar  to  the  serial  searchlight  of  attention. 

The  unraveling  of  the  full  complexity  of  visual  attention  will  clearly  involve  computational, 
psychophysic^,  and  physiological  research  and  in  turn  will  influence  not  only  our  understanding  of 
visual  perception  but  also  the  architecture  and  the  control  structure  of  machine  vision  systems. 
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Figure!.  In  (a)  each  separate  set  of  "face"  features  is  sufficient  to  suggest  the  hypothesis  of  a  face. 
In  (b)  it  is  the  spatial  relation  between  features  and  not  the  features  themselves  that  cue  recognition 
to  the  face  hypothesis. 
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